-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special dir warning #3436 #4229
Conversation
dvc/ignore.py
Outdated
special_dirs = { | ||
".petest_cache", | ||
".mypy_cache", | ||
"node_modules", | ||
".venv", | ||
"venv", | ||
"env", | ||
"__pycache__", | ||
"dist", | ||
"eggs", | ||
"wheels", | ||
"htmlcov", | ||
"docs", | ||
".ipynb_checkpoints", | ||
".vscode", | ||
".github", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, makes me wonder if we should really just ignore them by default (like we do with .git and .hg), and if some user really needs these, he could !something
in dvcignore to include it.
Or, we might consider generating a default .dvcignore (kinda like github generates .gitignore when you create a project) to make it extra explicit. Though this is simple on init
, but managing this stuff in existing projects is not too simple 🙁
I'm just worried about these warnings getting too annoying. Plus, unless I'm missing something, ignore execution order is not deterministic (that's actually is a bigger problem for us in general), so you never know if this filter will be executed before or after .dvcignores in which user has ignored a special dir explicitly. 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing that this makes me think about is having some kind of safe guard, that would, for example, record the amount of time it took to walk through the directory and warn if it didn't find anything but it took, say, more than 5 sec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue was opened after this conversation #3257. It was decided that it is better to warn the user if a special directory is found than to ignore it by default. But annoyance from warnings can be a problem. Maybe we should make it possible to disable this warning? From the other hand, the default .dvcignore also good idea because the ignore will be explicite to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes idea of default .dvcignore
compelling, is the fact that is not only explicit, but also editable by user.
If we decide to go with warnings, they definitely need turn off mechanism, just like we did with slow_link_guard
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I soppouse that the default .dvcignore
, is better option. It`s more explicite and easy to configure. Also the list of special dirs always will be incomplite, we can not add all posiible variants here. And default .dvcignore
provide additional example for users how to deal with such cases.
@efiop, @pared, if you are agree with me, then I will change the code. Should I close this PR and open a new one later, or commit right here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I love the idea of an explicit .dvcignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it should be default. Maybe we should add some option to init, or dedicate special command for that? It seems to me that .dvcignore
was supposed to let user decide what is part of his project. If we always create default .dvcignore
on init, we make it our choice. @efiop @aerubanov, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that default .dvcignore
should consist comment with link to the dvcignore documentation. And we should add option for init to filling .dvcignore
with some content. Not all list of special folder, rather based on type of project specified by user. Something like github gitignore template https://github.com/github/gitignore, but with fewer options ofcorse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aerubanov I think you are right on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing stuff @aerubanov ! Thank you! 🙏 Please see a comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether logic of finding potentially problematic dirs
should be part of dvcignore
.
Introducing this change in this form will cause us to check all directories that we visit during traversing.
That might have considerable impact for all users, while it will be helpful for a subset of them.
Maybe we should be doing this check only in a directory that also contains .dvc
? It's not the perfect solution, but it would not impact each walk
operation.
dvc/ignore.py
Outdated
|
||
def __call__(self, root, dirs, files): | ||
for d in dirs: | ||
if d in self.special_dirs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do something like set(dirs).intersection(self.special_dirs)
and print single warning for all problematic dirs.
I add default .dvcignore with link on docs and option for init to filling .dvcignore with some content based on project type specified by user. If there is no need for further changes to the functionality, I will add tests and changes to the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @aerubanov
@shcheklein? @efiop?
dvc/ignore_templates/__init__.py
Outdated
@@ -0,0 +1,2 @@ | |||
# templates for .dvcignore files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really have to be a module, as there is no actual code. You can either keep it as datafiles or maybe actually make these into the python shape by declaring these as strings. The latter way is easier to manage, as things like setup.py and pyinstaller will have a much easier time collecting these automatically.
dvc/ignore_templates/java
Outdated
@@ -0,0 +1,23 @@ | |||
# Compiled class file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I also wonder if these are really worth stealing directly from https://github.com/github/gitignore . They would be hard to maintain, really. But I'm not able to see a better way right now...
dvc/repo/__init__.py
Outdated
no_scm=no_scm, | ||
force=force, | ||
subdir=subdir, | ||
ignore_template_list=ignore_template_list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People will probably want these even in existing projects, and they will probably forget to do this during the init phase. So we could add another command to install these, but maybe it is actually better to just document that people might want to get .dvcignore from https://github.com/github/gitignore and use it.
This also makes me wonder if we should should really somehow reuse existing .gitignores (e.g. to not search for dvc-files and .dvcignores in .gitignore'd directories). But I suppose many people also want to exclude these dirs from dirs they are dvc add
ing (or tracking), which doesn't play well, because we gitignore the data we are adding. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop, Ok. May be we should add command for downloading template content from https://github.com/github/gitignore and pasting it in .dvcignore
? Then, we need not to store it in dvc code base and maintain, and user need not to do it by hand.
And also we can add second option for this command to read existing ,gitignore
, exlude all records that was added by dvc add
and paste it in .dvcignore
. Is it possible to easily get a list of all tracked files? (I think it should be, I just don't know where to look).
What do you think of these two possibilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop, Ok. May be we should add command for downloading template content from https://github.com/github/gitignore and pasting it in .dvcignore? Then, we need not to store it in dvc code base and maintain, and user need not to do it by hand.
@aerubanov I don't think it is worth creating a special command. It is not hard to just go to a link in the docs and copypaste contents to .dvcignore (or use wget/curl/etc).
And also we can add second option for this command to read existing ,gitignore, exlude all records that was added by dvc add and paste it in .dvcignore. Is it possible to easily get a list of all tracked files? (I think it should be, I just don't know where to look).
That sounds a bit complex :( We had a discussion with @pmrowla about it, and maybe we could just use .gitignore
directly, the only thing that is problematic is not dvcignoring dvc-tracked files, but that should be solvable by telling dvcignore about them when we collect the dag. We will soon have a pre-requisite PR that notifices dvcignore about those outputs for performace reasons and we might get back to this idea right after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i will make the necessary changes.
dvc/repo/init.py
Outdated
f.write( | ||
"# This is .dvcignore file." | ||
" See more https://dvc.org/doc/user-guide/dvcignore\n" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to create a default dvcignore. It'll make people aware of it. 👍 to that.
Regarding the comment, better to explain what it does. This is .dvcignore file.
is clear from the filename.
Something like following, perhaps?
f.write( | |
"# This is .dvcignore file." | |
" See more https://dvc.org/doc/user-guide/dvcignore\n" | |
) | |
f.write( | |
"# add patterns of files dvc should ignore, can improve performance\n" | |
"# Learn more at https://dvc.org/doc/user-guide/dvcignore\n" | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry Please see my comment above. Maybe we should just use .gitignore directly. Would love to know what you think about that.
* Add link on templates for .dvcignore. Related with iterative/dvc#4229 * Restyled by prettier Co-authored-by: Anatoly <[email protected]> Co-authored-by: Restyled.io <[email protected]>
Thank you @aerubanov ! 🙏 |
Since iterative#4229, we have started creating .dvcignore file after `dvc init` This commit only adds that in assertions
Since #4229, we have started creating .dvcignore file after `dvc init` This commit only adds that in assertions
|
||
with open(dvcignore, "w") as fobj: | ||
fobj.write( | ||
"# Add patterns of files dvc should ignore, which could improve\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"# Add patterns of files dvc should ignore, which could improve\n" | |
"# Add patterns of files DVC should ignore, which could improve\n" |
Added a warning message when traversing special directories and a test for that. Also removed unused arguments in a couple of tests in tests/func/test_ignore.py